On the thermodynamics of DNA methylation process

DNA methylation is an epigenetic mechanism that plays important roles in various biological processes including transcriptional and post-transcriptional regulation, genomic imprinting, aging, and stress response to environmental changes and disease. Consistent with thermodynamic principles acting within living systems and the application of maximum entropy principle, we propose a theoretical framework to understand and decode the DNA methylation process. A central tenet of this argument is that the probability density function of DNA methylation information-divergence summarizes the statistical biophysics underlying spontaneous methylation background and implicitly bears on the channel capacity of molecular machines conforming to Shannon’s capacity theorem. On this theoretical basis, contributions from the molecular machine (enzyme) logical operations to Gibb entropy (S) and Helmholtz free energy (F) are intrinsic. Application to the estimations of S on datasets from Arabidopsis thaliana suggests that, as a thermodynamic state variable, individual methylome entropy is completely determined by the current state of the system, which in biological terms translates to a correspondence between estimated entropy values and observable phenotypic state. In patients with different types of cancer, results suggest that a significant information loss occurs in the transition from differentiated (healthy) tissues to cancer cells. This type of analysis may have important implications for early-stage diagnostics. The analysis of entropy fluctuations on experimental datasets revealed existence of restrictions on the magnitude of genome-wide methylation changes originating by organismal response to environmental changes. Only dysfunctional stages observed in the Arabidopsis mutant met1 and in cancer cells do not conform to these rules.

Statistical-physical modeling of the methylation background process. The most probable distribution of methylation states for a DNA molecule, driven by spontaneous/random fluctuations, can be obtained by maximizing the thermodynamic entropy under general system constraints: i) i π i = 1 and ii) i π i E i = �E� , where π i is the (discrete) probability to observe dissipation of the energy value E i , and E is the mathematical expectation of E . Under these assumptions, Jaynes' MEP leads to Boltzmann distribution as the most probable distribution of the system 18,27 . Assuming that the energies E i dissipated to reach the states i of the system are essentially a continuum, with some density A E β , . . . of methylation changes and energies dissipated E, the probability to observe genome-wide energy dissipation between 0 and E can be estimated 28 18 ) leads to Boltzmann distribution as most probable for the methylation system 18,24 . Criteria derived from molecular machine channel capacity and further maximum likelihood estimations lead to the theoretical derivation of a generalized gamma distribution model as best to describe genome-wide methylation changes observable in an individual dataset. This model is expressed in terms of information divergence of methylation changes χ : E = χ k B Tθ −1 . The state of the methylation system is described by generalized gamma probability density function, from which analytical expression for methylation system entropy is derived. Analysis of experimental datasets from Arabidopsis and human cancer allow expression of the fluctuation theorem in a DNA methylation context. www.nature.com/scientificreports/ Notice that for A(E, β, . . .) = 1 , the last equation reduces to the classical expression for Boltzmann distribution. Equation (2) is a general probabilistic model of the methylation background process conforming to an exponential decay law. According to Eq. (2), it is expected that for any case of f ( E|β, . . .) , the probability to observe a methylation change will decline with the increment of energy dissipated per bit of information processed by molecular machines (methyltransferase and demethylase activity). In the following sections, we set out information-thermodynamic constraints on the molecular methylation machinery that permit a maximum likelihood estimation of function f ( E|β, . . .).
The channel capacity of methylation machinery. A fundamental constraint to deriving a probability density function of DNA methylation changes involves physics of information in molecular machine operations. Machine capacity is closely related to Shannon's channel capacity 29 as the maximum amount of information that a molecular machine can gain per operation 20 . Following Schneider 20 , the machine capacity is bounded by: C = d space log 2 P y +N y N y , where P y is the energy dissipated by a molecular machine,N y energy of the thermal noise, and d space the number of independently moving parts of a molecular machine involved in the operation 21 . Following Shannon 29 , the received signals have an energy average E y = P y + N y . We shall denote by E 0 = N y the energy dissipated with probability = 1 and d space = ν − 1 to arrive at C v = (ν − 1) log 2 E y E 0 ( ν = α δ , Supplementary Information (SI) section A ), which implies: Probability density function of the methylation background changes. Equations 1 and 2 quantitatively summarize the statistical physics underlying methylation changes that are not induced by the methylation regulatory machinery. Application of thermodynamic principles to chromatin dynamics tends to maximize Boltzmann entropy, leading to the most probable methylation density states. We sought to maximize the prob- . . , p k . Two basic assumptions were imposed on p i , N i and E i : (1) probabilities p i are proportional to a specific power of the energies E i : (2) for each choice of α the following sum is a positive constant: where E i > 0 ; N i 's are assumed to be large numbers.
The first assumption derives from the interpretation of channel capacity of molecular machines given by Eq. (3) 20 as log 2 p i ≤ C v . The second assumption implies that parameter α carries information about the molecular machine, since ν = α δ (SI A). A maximum likelihood estimation of function f ( E|β, . . .) , on a thermodynamic basis, adapts the Lienhard and Meyer approach 30 to the specific scenario of DNA methylation (provided in the SI A). The above assumptions (not given in 30  www.nature.com/scientificreports/ Assuming that E k B T = χ θ ( χ in bit units), the energy dissipated can be estimated as: According to Landauer's principle, a molecular machine working under ideal conditions dissipates the minimum energy E = χk B T ln 2 , with θ = 1 ln 2 in ideal conditions. A more general distribution that includes the location parameter µ is given as: which has mean: and variance: χ p, q can be expressed in terms of the Hellinger divergence given by Sanchez et al. 9 or in terms of J-divergence 31 . The most frequent members of a general gamma distribution family found by goodness-of-fit tests for processed bisulfite sequence datasets from different species are Weibull ( δ = 1 ) and Gamma ( α = 1 ) distributions 9,32 , obtained as particular cases from the generalized gamma probability density function.
A connection with Shannon's communication theory. As suggested in past reports 17,33 , genomewide patterning of cytosine DNA methylation can occur at specific landmarks, statistically alluding to the existence of a methylation language/code 33,34 , where methylation messages are created within the framework of a communication system. In terms of Shannon's communication theory, a communication system can be described by the conditional probability (density) P x y , so that if message x is produced by the source, the recovered message at the receiving point will be y 29 . Shannon defined the rate R 1 of generating information for a given quality v 1 = ρ x, y P x, y dx dy of reproduction to be R = Min P x (y) P x, y log P(x,y) P(x)P(y) dx dy at fixed v 1 and variable P x y .
In Shannon's analysis, the conditional probability P y (x) that minimizes the rate R is given by the expression P y (x) = B(x)e − ρ(x,y) , where B(x) is chosen to satisfy B(x)e − ρ(x,y) dx = 1 29 , and ρ x, y is a distance function. In this analysis, function ρ x, y behaves as a "distance" between x and y to measure the unlikelihood, based on a fidelity criterion, to receive y with transmission of x. In function B(x) , the transmitted message x can be expressed at each cytosine site in terms of observed methylation levels in a treatment or a patient group. Methylation levels are estimated as: nC m i (nC m + nC i ) , where nC m i and nC i are the number of times the cytosine is methylated and unmethylated at site i , respectively. The received message y can be specified as reference methylation levels, which could be the centroid of a group control or estimation from an independent subset of control samples from a control population. The function ρ x, y can be expressed in terms of a symmetric information divergence χ x, y between the methylation levels x and y. For a fixed reference y, the equality χ x, y = χ(x) makes it possible to choose B(x) as: where dχ = χ ′ (x)dx and = 1 θ . The conditional probability P y (x) , if the recovered message at the receiving point is y and the original message produced by the source is x , can be reinterpreted (after change of variables) as: This equation indicates the probability that, if the recovered message at the receiving point is y , then the information divergence between y and the original message x produced by the source is χ . These applications of Shannon's reasoning lead to the following: Theorem 1 If an organismal methylation system conforms to a communication system, then optimal methylation messaging is described by Eqs. (13), (9).
The Gibb entropy of the system. The Gibb entropy of a system resulting from methylation changes is defined by the integral: www.nature.com/scientificreports/ (or simply S, since S(0) = 0) which yields the known analytical expression (SI B): where ψ(δ) = d ln Ŵ(δ) dδ stands for the digamma function. After considering Eq. (6), we can write: Thus, entropy of an individual methylation system is split into a classical term and contribution from molecular machine activity: A rough estimation of Gibb entropy for different organismal tissues/cells can be based on the information divergence χ i after expressing energy E i in terms of χ i according to Eq. (9): where the term φ(α, δ) = ψ(δ) 1 α − δ + δ is a function of a model parameter associated to the number of independent activities of the molecular machine ( ν = α δ).
Since log 2 x = ln x ln 2 , Eq. (17) can be written as: The terms in brackets from Eq. (17) and (17a) (at constant temperature) correspond to Shannon entropy H, which depends only on the distribution parameters in this case, numerical values that can be estimated from experimental data for each individual. Thus, the Shannon entropy H can be written as: and Following Schneider 26 , a decrease in methylome entropy: requires a corresponding decrease in the uncertainty of genome-wide methylation changes: Following a decrease in this uncertainty, the methylome gains information I m : That is, Or expressed in Joule per Kelvin: Information-theoretical entropy and thermodynamic entropy yield identical outcomes, up to the product of Boltzmann's constant by ln 2, even though they are independent functions 19 .
Thermodynamic potential of methylation changes. Assuming that a balance exists between methylation and demethylation processes along each DNA molecule, the overall mass (number of molecules N) and volume (V) of the DNA molecule remain constant. This assumption holds in most experimental datasets since, for large genomic regions, the sum of the difference in methylation level is close to zero. Under this condition, and assuming a constant temperature (T), methylation changes and the micro-environment around them can be treated as a closed system to mass transport but not energy transfer. In statistical physics, this system is referred to as a NVT system, with the thermodynamic variables N, V, and T held fixed. Helmholtz free energy (F) repre- Molecular machine moving parts contribution www.nature.com/scientificreports/ sents the driving force for NVT systems, the thermodynamic potential that measures "useful" work obtainable from a closed system at a constant temperature and volume. Helmholtz free energy can be estimated from its definition: F = U − TS . Assuming that the molecular machine operations do not change the internal energy U of the system, we have: F = −T S , i.e.: The same result derives from the Gibbs free energy definition: G = H − TS . Considering that the molecular machine operations do not change the system pressure ( H = 0 ): G = −T S . Equation (22) roughly estimates how much Helmholtz free energy would be involved in methylation. Rough estimations based on the information divergence χ can use the approach: where β = k B T . Considering Eq. (16,) Helmholtz free energy can be split into the classical term and contribution of molecular machine activities: According to Eq. (7): α . The particular cases of S G and F(β) for Weibull and Gamma distributions are obtained with parameter values δ = 1 and α = 1 , respectively. Substitution of Eq. (17a) in Eq. (23) yields: At constant temperature, F decreases with the increment of Shannon entropy of the system. The variation of Helmholtz free energy F = F after − F before between two system states (before and after) can be expressed as: After considering Eqs. (20), (21), and (25), an energetically favorable process is: where a loss of information ( I m < 0 ) will be associated with a loss of free energy ��F < 0.
Biological implications of these observations. The theoretical framework presented can be summarized into two biologically intuitive hypotheses: 1. The entropy of methylation variation, measured with respect to some reference, coincides with observable phenotypic change. Thus, entropy provides a highly sensitive measure of organismal epigenetic state. 2. Disruption of methylation machinery will generate large fluctuations in the methylation signal outside of the expected range of fluctuations for normal/healthy tissues.
The first hypothesis rests on the premise that entropy is a thermodynamic state variable of the system, which means that its value is completely determined by current state of the system and not by how the system reached that state. The second hypothesis presumes that methylation machinery participates in organismal adaptation to environmental changes, and this process requires a non-equilibrium feedback control. To adapt to environmental change, organisms must rely on molecular mechanisms to sense changes and trigger regulatory adaptative responses 35 .
To test our hypotheses, we analyzed Arabidopsis thaliana and human methylome datasets. Functions for Gibbs entropy and Helmholtz free energy estimations, as given by Eqs. (17) and (22), respectively, are currently included in MethylIT R package (see Supporting Information). Entropy was estimated in Arabidopsis thaliana Col-0 ecotypes (wild type controls, WT), the methyltransferase mutant met1 36 , and first-and third-generation heritable epigenetic memory states (nm1, mm1, and mm3) that derive as epigenetically modified progeny from a parental line following suppression of MSH1 expression 37 .
In plants, CG methylation is maintained by METHYLTRANSFERSE1 (MET1), and mutations that disrupt its activity induce genome-wide hypomethylation in CG context. Consequently, we expect to observe a significant loss of information in datasets from met1 plants relative to wild type. In the case of msh1 memory state, heritable epigenetic stress memory is observed following segregation of an MSH1-RNAi transgene, yielding ca. 20% of transgene-null progeny with a heritable memory phenotype of delayed maturation and sustained stress response (mm1, mm3), and the remainder appearing phenotypically unchanged and designated "non-memory" (nm1). The msh1 memory system was described previously 37 , and both memory (mm1) and non-memory (nm1) full-sib types display evidence of genome-wide cytosine methylation repatterning relative to wild type. Here, we include analysis of first-generation (mm1) and third-generation (mm3) samples from the same msh1 memory lineage and predict these variants to display lesser incremental effect on entropy variation than met1. Results shown in Table 1 confirm these predicted outcomes. www.nature.com/scientificreports/ The effect of an msh1 suppression line on genome-wide methylation changes in epigenetic memory and non-memory progeny, generations 1 and 3, was reflected in a discrete increment of entropy and, consequently, loss of information: �S = S control − S mutant < 0 26 . This observation is further evidence of epigenetic effects that give rise to the memory state 37 . Loss of information in the met1 mutant was much greater than in msh1 memory, consistent with the profound effects of genome-wide CG demethylation; CG is the predominant genic methylation context in animals and plants.
Our results suggest that entropy can serve as a highly sensitive measure of the state of an organism. For example, we also observed significant differences in the entropy values for Col-0 wildtype controls WT3 and WT met1 . Although these wildtype controls derive from the same Arabidopsis Col-0 accession, they differ in ontogeny. WT met1 plants were grown under continuous light for 2 weeks in half-strength Gamborg's B5 media, while WT3 plants were grown to maturity on standard peat mix in pots maintained at 12-h daylength and sampled at bolting stage. We consider these differences in plant stage and growth conditions to account for the marked entropy differences observed.
In human cancer studies, Gibb entropies for different cancer cells and the corresponding healthy tissue/cell controls are presented in Table 2. Outcomes suggest that Gibb entropy increases for all cancer cells relative to their corresponding normal tissue. Since information divergences were computed with respect to the same reference individual, the observed entropy values suggest that breast metastasis cells underwent the most aggressive loss of information (assuming that experimental errors were not sufficient to affect the estimated values). The relationship between Gibb entropy and Helmholtz free energy predicts results shown in Table 3. After the methylation reprogramming that transforms differentiated healthy cells to a cancer state, the information potential of cancer cells appears to decrease dramatically relative to healthy cells. These data reflect an important, previously undocumented, means of assessing the state of a biological system. The overall results support our hypothesis that entropy estimation is a highly sensitive measure of organismal epigenetic state. Table 1. Gibb entropy 1 estimated in several Arabidopsis mutants and corresponding Col-0 controls (WT). 1 Entropy values were estimated using Eq. (17) and J-divergence 31 . The values are given in J × K −1 × mol −1 , after replacing Boltzmann constant by the Gas constant. 2 Loss of Information I m is given by Eq. (20a). 3 Helmholtz free energy F values were estimated using Eq. (26a) and J-divergence 31 . The values are given in J × mol −1 . 4 Symbols '**' and '***' indicate highly statistically significant differences at p-value < 0.01 and p-value < 10 -16 between mutant or memory state, respectively. Symbol † indicates Wilcoxon paired test, otherwise testing was conducted applying linear mixed model. www.nature.com/scientificreports/ To test our second hypothesis, we first addressed the inference that in differentiated healthy tissue, the physical work accomplished by the methylation machinery must lead to a decrease in genome-wide methylation uncertainty, reflected in the values of (dimensionless) entropy k −1 B S . This inference is supported by regression analysis k −1 B |S| versus ν accomplished in Arabidopsis and human datasets (Fig. 2a,b). K-means algorithm was applied to clustering chromosomes from all cancer types into the two groups denoted in Fig. 2 as: 'cancer I' and 'cancer II' . Figure 2b shows that a subset of chromosomes from all cancer types appears to transition from a trend relatively close to healthy state (with negative slope, 'cancer I') to a weakly positive linear trend ('cancer II') in the direction of human embryonic stem cells (HESCs). A positive linear trend was also found in the Arabidopsis met1 mutant (Fig. 2a).
These results provide us with an empirical estimation of the entropy fluctuations through the regression analysis e −k −1 B |S| versus e −ν (Fig. 2c,d), which leads to the equation: where η is a proportionality constant. Or equivalently: As shown in Eq. (27a) a negative value for model parameter η (negative slope) is indicative of nonequilibrium feedback control. In an epigenetic context, nonequilibrium feedback control refers to the control accomplished by epigenetic regulatory machinery such as methyltransferases and demethylases. Figure 2c,d show that only the Arabidopsis met1 mutant, chromosomes of all cancer types, and embryonic stem cells showed a positive slope η > 0. where c ∈ O ν 2 , which, within the limits of numerical error, approximates a constant not necessarily statistically significant. As shown in Fig. 2e,f, linear regression analysis confirms the statistical trend predicted by Eqs. (28) and (28a). With the exception of extreme conditions found in Arabidopsis mutant met1 (red points, Fig. 2a,c,e subplots), cancer chromosomes from group II and stem cells (magenta points), the remainder of the data support Eqs. (27) and (28).  www.nature.com/scientificreports/ Another way to arrive to Eq. (27a) is to consider the average of the sum of Boltzmann's factors e −k −1 B |S| and e −ν . Results suggest that the average sum of e −k −1 B |S| + e −ν appears constant (Fig. 3). No statistical differences were found between the overall means of values from Arabidopsis (Fig. 3a) and humans (Fig. 3b), which leads us to postulate: where η has a value close to 1. Thus, we can write e −k −1 B |S| = 1 − e −ν and, considering nonequilibrium feedback control 38 , e −k −1 B |S| = η 1 − e −ν , which leads to Eq. (27). Small-range fluctuations are expected in normal healthy tissues, while notable fluctuation is expected in tissues/cells experiencing a disruption in methylation regulatory machinery. This last case is found in cancer cells shown in Fig. 3a, where the case of glioma departs substantially from healthy brain tissue and fluctuates at the level of stem cells. In biological terms, Eqs. (27)(28)(29) imply that the magnitude of genome-wide methylation changes originating in response to environmental change is restricted. Disease would presumably occur by large fluctuations outside the range of expected variation in healthy tissues. www.nature.com/scientificreports/

Discussion
We present a theoretical premise to account for DNA methylation variation behavior. Our results describe the information thermodynamics of cytosine methylation, extending well beyond the simple application of Eq. (9) as the null hypothesis required for methylation analysis. Results confirm that members of the generalized gamma probability distribution family, given by Eq. (6), quantitatively summarize the statistical physics underlying spontaneous methylation variation driven by random fluctuations. Parameters from Eq. (6) carry information about channel capacity of molecular machines 20,21 that relates to Shannon's capacity theorem. Equation (9) can be interpreted as a conditional probability density distribution. The conditional probability interpretation of methylation (Eq. 13) assumes that the message remains constant in the control population and, under conditions of environmental variation or disease, changes in some subpopulation represented in treatment or patient datasets.
The conditional probability density P y (χ) indicates that if the recovered message at the receiving point is y, then P y (χ) will decline exponentially with the information divergence χ x, y between y and the message x produced by the source. Thus, if DNA methylation conforms to a communication system, then optimal coding of the methylation message is described in Eq. (9).
Methylation changes that support DNA thermal stability are expected to be present in highest frequency and with relatively small divergence values. Observed data from control populations show information divergence values χ x, y to be small, representing the housekeeping or background "noise" in the system. We expect that the probability P χ x, y > χ 0.95 to observe methylation background fluctuation with a value χ x, y greater than the 95% quantile χ 0.95 is lesser than 0.05 ( P χ x, y > χ 0.95 = 1 − P χ x, y ≤ χ 0.95 ). In other words, Eq. (9) can be applied as null hypothesis in a signal detection-based approach to discriminate the methylation regulatory signal (expected with values χ x, y > χ 0.95 ) from methylation background 9,32 .
The methylation message is presumably encoded within the mechanical properties of the DNA molecule 1,2 . For example, flexibility or rigidity of the DNA double helix is required for regulating nucleosome folding and transcription factor (TF) binding to DNA sequence motifs 39,40 . Depending on DNA sequence context, the addition or removal of methyl groups to cytosine bases is predicted to alter these local physical properties 1,2 .
Gibb entropy and Helmholtz free energy, given by Eqs. (17) and (23), suggest a substantial distinction between classical statistical mechanics and statistical biophysics of the methylation process by considering the entropy contribution from the molecular machine (enzyme) through conformational changes, which is expressed in the term φ(α, δ) from Eq. (17). Application of Eqs. (17) and (23) to experimental datasets can provide important biological insights. Results shown in Table 1 indicate that, as a thermodynamic state variable, the entropy given by Eq. (17) estimates the state of the methylation system consistent with phenotypic observations. The epigenetic www.nature.com/scientificreports/ memory lines in Arabidopsis produced an incremental effect on information loss observed from nm1 to mm3. A much greater difference in energy (−2228.45 J × K −1 × mol −1 ) was observed between met1 mutant and its corresponding experimental control, where the minus sign "-" indicates that the transformation was energetically favorable ( ��F < 0 ) and that a loss of information ( I m < 0 ) occurred in this transformation (Eq. (26a)). Thus, the met1 mutant, which undergoes a genome-wide loss in CG methylation 41 , provides a reference for extreme methylation change and information loss (Table 1). Results presented in Tables 2 and 3 are biologically intuitive when considering the transformation of a pluripotent embryonic stem cell to a differentiated cell. From ovule to embryo to multicellular development involves continuous increase in order, translated to net gain of information 42,43 . We suggest that this phenomenon is reflected in methylome features.
Our data indicate that transformation of normal cells to cancer cells leads to an increase in entropy and, consequently, a loss of information �S = S healthy cells − S cancer cells < 0 26 ( I m < 0 ). Biological evidence similarly suggests that a loss of information from the original tissue occurs when cancer stem cells, a sub-population from within the tumor mass, derive from cancer cells 44,45 . Jointly, results from Tables 1 and 2 are in agreement with these known effects.
Fluctuation constraints revealed by Eqs. (27) to (29) are concerned with preserving the best coding and fidelity of the methylation message at receiver point, permitting sufficient variation of methylation signal to ensure organismal adaptation to environmental change. This concept is supported by the results obtained with the extreme scenarios shown for Arabidopsis mutant met1, cancer samples, and stem cells, where outcomes do not hold to models given in Eqs. (27) to (29). The met1 mutation leads to an almost complete loss of CG gene-body methylation in Arabidopsis and a substantial ectopic CHG and CHH hypermethylation at genes and transposable elements 46 . The methylation reprogramming induced by cancer cells is also well documented 32,47 and the massive loss of information is supported by the results shown in Table 2.
The case of embryonic stem cells is different from met1 mutant and cancer cells. DNA methylation is not necessarily required in embryonic stem cells. Even when CG methylation is completely lost by combined knockout of three mammalian DNA methyltransferases Dnmt1, Dnmt3a, and Dnmt3b, there is a minimal change in phenotype in undifferentiated stem cells 48 .
The experimental finding of Eqs. (27) to (29), as applied to methylome datasets from human and Arabidopsis chromosomes, may be informative about the DNA methylation process and potential influence of methylation in system buffering. Equation (27) predicts limits in the system's capacity to confront and minimize the effect of random entropy fluctuations. As suggested in Fig. 2, surpassing these limits could reflect system breakdown [49][50][51] .
The connection with Shannon's communication theory reveals a future avenue for application of discretestate kinetics derived from a Markov model 29 of the information source. A discrete-kinetic approach from the implicit Markov model of the source, and the evolution of such an epigenetic process, can be studied through the corresponding master equations that obey Chapman-Kolmogorov equations. Existence of epigenomic states is not only evident for the observable individual disease and heathy conditions, but also across the aging process 52 .
An intricate balance is expected for most epigenetic processes, which can be reversed 53 . That is, unlike DNA mutations, DNA methylation changes and consequent epigenetic alterations are, at least theoretically, reversible 6 . Thus, we can study the epigenomic process across organismal ontogeny as a stationary and ergodic Markov process.
As noted by Gorban 54 , " "the only difference between the general first order (chemical) kinetics and master equation for the probability distribution is in the balance conditions: the sum of probabilities should be 1, whereas the sum of variables (concentrations) for the general first order kinetics may be any positive number. " From this perspective, the methylation regulatory signal, and associated epigenomic processes, reflects a system transitioning between possible stationary states in which an organism must constantly adapt to new environmental conditions. Development of this modeling is beyond the scope of our current study.
The primary goal of this study was to establish a theoretical basis for understanding DNA methylation behavior, but the practical outcomes of entropy estimates suggest that our results may have important implications for early diagnostics and assessing change in organismal state. Results suggest that information loss (entropy increments) and, consequently, DNA methylation reprogramming characterize cancer progression, suggesting that epigenetic mechanisms might be influential in cancer metastasis 55,56 . Our results also suggest that detection of early disease development stages on the basis of physical-informational chromosome states would be feasible.

Materials and methods
Biological experimental datasets. The Arabidopsis thaliana methylome datasets (with results reported in Table 1) from bisulfite sequencing of msh1 memory and non-memory (normal looking) sibling plants with isogenic Col-0 wild-type control in Arabidopsis were downloaded from the Gene Expression Omnibus (GEO) Series GSE129303a and GSE118874.
The methylome datasets for met1 mutant and corresponding wildtype were taken from the GEO Series GSE122394. The fastq files from Arabidopsis methylome met1 mutant and corresponding wildtype datasets were downloaded from the European Nucleotide Archive (ENA, https:// www. ebi. ac. uk/ ena/ brows er/ home). The raw read counts for met1 methylated and non-methylated cytosines for further methylation analysis were obtained as follows: Raw sequencing reads were quality-controlled with FastQC (version 0.11.5), trimmed with TrimGalore! (version 0.4.1) and Cutadapt (version 1.15), then aligned to the TAIR10 reference genome using Bismark (version 0.19.0) with bowtie2 (version 2.3.3.1). The deduplicate_bismark function in Bismark with default parameters was used to remove duplicated reads and reads with coverage greater than 500 were removed to control PCR bias. Methylated Cs (COV files) were acquired from Bismark methylation extractor with default parameters. www.nature.com/scientificreports/ The cancer and healthy tissues controls ( Table 2) were downloaded from the GEO Series GSE52271. Blood B-cells CD19 (GSM1279518) was used as reference in the computation of information divergences J-divergences (JD). The Bi-seq dataset of Naive Human Embryonic Pluripotent Stem Cells have GEO accessions: GSM2041690, GSM2041691, and GSM2041692.
A more detailed description of these datasets is given in SI B.1.

Computational tools and statistical analysis.
The estimations of J-divergences, the best nonlinear fitted model to member of the generalized gamma distribution (Eqs. 9 and 11), Gibb entropy, and Helmholtz free energy were accomplished using functions from MethylIT R package (version 0.3.2.4): gibb_entropy and helmholtz_free_energy, respectively (https:// genom aths. github. io/ methy lit/). The estimations of the Boltzmann's factors shown in Figs. 2 and 3 were accomplished using MethylIT function boltzman_factor. All R scripts for Tables 1, 2, 3 results are available as SI.
The group comparison shown in Table 1 was accomplished in the lme4 R package (version 1.1-27.1) applying a linear mixed model with chromosome random effects with formula: entropy = group + (1|chromosome).

Data availability
All the methylome datasets and software used in this work are publicly available at GitHub: https:// github. com/ genom aths/ Methy lIT (version 0.3.2.4). As specified in Material and Methods section (and in the SI), all methylome raw data used in the scripts has been downloaded from GEO or ENA databases. Intermediate datasets used in the downstream analysis to support the conclusions of this report are available on GitLab at Penn State at https:// git. psu. edu/ genom ath/ datas ets. R script to accomplish all the computations are included within SI. So, readers can reproduce all the computations accomplished in this study.